Exploring text datasets by visualizing relevant words

نویسندگان

  • Franziska Horn
  • Leila Arras
  • Grégoire Montavon
  • Klaus-Robert Müller
  • Wojciech Samek
چکیده

When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In this paper we propose to extract ‘relevant words’ from a collection of texts, which summarize the contents of documents belonging to a certain class (or discovered cluster in the case of unlabeled datasets), and visualize them in word clouds to allow for a survey of salient features at a glance. We compare three methods for extracting relevant words and demonstrate the usefulness of the resulting word clouds by providing an overview of the classes contained in a dataset of scientific publications as well as by discovering trending topics from recent New York Times article snippets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering topics in text datasets by visualizing relevant words

When dealing with large collections of documents, it is imperative to quickly get an overview of the texts’ contents. In this paper we show how this can be achieved by using a clustering algorithm to identify topics in the dataset and then selecting and visualizing relevant words, which distinguish a group of documents from the rest of the texts, to summarize the contents of the documents belon...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

User-Directed Sentiment Analysis: Visualizing The Affective Content Of Documents

Recent advances in text analysis have led to finer-grained semantic analysis, including automatic sentiment analysis— the task of measuring documents, or chunks of text, based on emotive categories, such as positive or negative. However, considerably less progress has been made on efficient ways of exploring these measurements. This paper discusses approaches for visualizing the affective conte...

متن کامل

Visualizing Spoken Interaction

This paper introduces and explores the use of IVEE, a technique for visualization and dynamic database queries applied to a database of transcriptions from various types of linguistic interaction. The use of IVEE is illustrated by presentation of data on vocabulary richness, verbal dominance, hesitance, uncertainty and liviliness, 1. Visualizing and exploring complex datasets with dynamic queri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1707.05261  شماره 

صفحات  -

تاریخ انتشار 2017